Web Corpus Cleaning using Content and Structure

نویسندگان

  • Katja Hofmann
  • Wouter Weerkamp
چکیده

This paper describes experiments on cleaning web corpora. While previously described approaches focus mainly on the visual representation of web pages, we evaluate approaches that rely on content and structure. We have evaluated a heuristics-based approach, as well as approaches based on decision trees, a genetic algorithm and language models. The best performance was achieved using the heuristics-based approach.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Victor: the Web-Page Cleaning Tool

In this paper we present a complete solution for automatic cleaning of arbitrary HTML pages with a goal of using web data as a corpus in the area of natural language processing and computational linguistics. We employ a sequence-labeling approach based on Conditional Random Fields (CRF). Every block of text in analyzed web page is assigned a set of features extracted from the textual content an...

متن کامل

Web Page Cleaning with Conditional Random Fields

This paper describes the participation of the Charles University in Cleaneval 2007, the shared task and competitive evaluation of automatic systems for cleaning arbitrary web pages with the goal of preparing web data for use as a corpus in the area of computational linguistics and natural language processing. We try to solve this task as a sequence-labeling problem and our experimental system i...

متن کامل

Cleaneval: a Competition for Cleaning Web Pages

Cleaneval is a shared task and competitive evaluation on the topic of cleaning arbitrary web pages, with the goal of preparing web data for use as a corpus for linguistic and language technology research and development. The first exercise took place in 2007. We describe how it was set up, results, and lessons learnt.

متن کامل

Building a Web Corpus of Czech

Large corpora are essential to modern methods of computational linguistics and natural language processing. In this paper, we describe an ongoing project whose aim is to build a largest corpus of Czech texts. We are building the corpus from Czech Internet web pages, using (and, if needed, developing) advanced downloading, cleaning and automatic linguistic processing tools. Our concern is to kee...

متن کامل

Data Preparation for Web Mining – A survey

An accepted trend is to categorize web mining into three main areas: web content mining, web structure mining and web usage mining. Web content mining involves extracting details/information from the contents of webpages and performing things like knowledge synthesis. Web structure mining involves the usage of graph theory to understand website structure/hierarchy. Web usage mining involves the...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007